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Abstract 

We propose a new action and gesture recognition method 
based on spatio-temporal covariance descriptors and a 
weighted Riemannian locality preserving projection ap- 
proach that takes into account the curved space formed by 
the descriptors. The weighted projection is then exploited 
during boosting to create a final multiclass classification 
algorithm that employs the most useful spatio-temporal re- 
gions. We also show how the descriptors can be computed 
quickly through the use of integral video representations. 
Experiments on the UCF sport, CK+ facial expression and 
Cambridge hand gesture datasets indicate superior perfor- 
mance of the proposed method compared to several recent 
state-of-the-art techniques. The proposed method is robust 
and does not require additional processing of the videos, 
such as foreground detection, interest-point detection or 
tracking. 

1. Introduction 

Video-based classification plays a key role in human 
motion analysis fields such as action and gesture recog- 
nition. Both fields have shown promising applications in 
many areas, including security and surveillance, content- 
based video analysis, human-computer interaction and ani- 
mation. According to a recent survey on recognition of hu- 
man activities [28], the focus has shifted to methods that do 
not rely on human body models, where the information is 
extracted directly from the images and hence being less de- 
pendent on reliable segmentation and tracking algorithms. 
Such image representation methods can be categorised into 
global and local based approaches [22]. 

Methods with global image representation encode visual 
information as a whole. Ali and Shah [1] extract a series 
of kinematic features based on optical flow. A group of 
kinematic modes is found using principal component anal- 
ysis. Guo et al. [6] encode the same kinematic features us- 
ing sparse representation of covariance matrices. Several 
methods first divide the region of interest into a fixed spatial 
or temporal grid, extract features inside each cell and then 
combine them into a global representation. For example, 
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this can be achieved using local binary patterns (LBP) [11], 
or histograms of oriented gradients (HOG) [27]. Global rep- 
resentations are sensitive to viewpoint, noise and occlusions 
which may lead to unreliable classification. Furthermore, 
global representations depend on reliable localisation of the 
region of interest [22]. 

Local representations are designed to deal with the 
abovementioned issues by describing the visual information 
as a collection of patches, usually at the cost of increased 
computation. Laptev and Lindeberg [14] extract interest 
points using a 3D Harris corner detector and use the points 
for modelling the actions. One of the major drawbacks is 
the low number of interest points that are able to remain 
stable across an image sequence. A common solution is to 
work with windowed data, extracting salient regions which 
can be represented using Gabor filtering [4]. 

Wang et al. [31] showed that dense sampling approaches 
tend to perform better compared to interest point based ap- 
proaches. Dense sampling is typically done for a set of 
patches inside the region of interest. Features are extracted 
from each patch to form a descriptor. These descriptor rep- 
resentations differ from grid-based global representations in 
that they can have an arbitrary position and size, and that 
the patches are not combined to form a single representa- 
tion but form a set of multiple representations. Examples 
are HOG and HOF (histogram of oriented flow) descrip- 
tors [15], SIFT descriptors [17], and their respective spatio- 
temporal versions, H0G3D [31] and 3D SIFT [26]. Be- 
cause of the likely large number of descriptors and/or their 
high dimensionality, comparing sets of descriptors is often 
not straightforward. This has led to compressed represen- 
tations such as formulating sets of descriptors as bags-of- 
words [21]. 

In this paper we propose the use of spatio-temporal 
covariance descriptors for action and gesture recognition 
tasks. Flat region covariance descriptors were first pro- 
posed for the task of object detection and classification in 
images [29] . Each covariance descriptor represents the fea- 
tures inside an image region as a normalised covariance ma- 
trix. They have led to improved results over related descrip- 
tors such as HOG, in terms of detection performance as well 
as robustness to translation and scale [29]. Furthermore, co- 
variance matrices provide a low dimensional representation 
which enables efficient comparison between sets of covari- 
ance descriptors. 



The proposed spatio-temporal descriptors, which we 
name Cov3D, belong to the group of symmetric positive 
definite matrices which do not form a vector space. They 
can be formulated as a connected Riemannian manifold, 
and taking into account the non-linear nature of the space of 
the descriptors may lead to improved classification results. 
The most common approach for classification on manifolds 
is to first map the points into an appropriate Euclidean rep- 
resentation [16] and then use traditional machine learning 
methods. A recent example of mapping is the Riemannian 
locality preserving projection (RLPP) technique [8]. 

The Cov3D descriptors are extracted from spatio- 
temporal windows inside sample videos, with the number 
of possible windows being very large. As such, we use 
a boosting approach to search the windows to find a sub- 
set which is the most useful for classification. We pro- 
pose to extend RLPP by weighting (WRLPP), in order to 
take into account the weights of the training samples. This 
weighted projection leads to a better representation of the 
neighbourhoods around the most critical training samples 
during each boosting iteration. The proposed Cov3D de- 
scriptors, in conjunction with the classification approach 
based on WRLPP boosting, lead to a state-of-the-art method 
for action and gesture recognition. 

We continue the paper as follows. In Section 2 we de- 
scribe the spatio-temporal covariance descriptors, and use 
the concept of integral video to enable fast calculation in- 
side any spatio-temporal window. In Section 3, we first 
overview the concept of Riemannian manifolds formulated 
in the context of positive definite symmetric matrices, and 
then detail the proposed boosting classification approach 
based on weighted Riemannian locality preserving projec- 
tion. In Section 4, we compare the performance of the pro- 
posed method against several recent state-of-the-art meth- 
ods on three benchmark datasets. Concluding remarks and 
possible future directions are given in Section 5. 
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Figure 1. Conceptual demonstration for obtaining a Cov3D 
spatio-temporal covariance descriptor. A spatio-temporal window 
R is defined inside the input video. For each pixel in R a. fea- 
ture vector Zi is calculated. The feature vectors are then used to 
compute the covariance matrix CovSDji. 



2. Cov3D Descriptors 

In this section we first present the general form of the 
proposed spatio-temporal covariance descriptors (Cov3D), 
an algorithm for their fast calculation, and finally how they 
can be specialised for action and gesture recognition. For 
convenience, we follow the notation in [29] . 

Let V be the sequence of images and F be the 

W X H X T X d dimensional feature video extracted from V: 

F{x,y,t) = ^{V,x,y,t) (1) 

where the function ^ can be any mapping such as inten- 
sity, colour, gradients, or optical flow. For a given spatio- 
temporal window R c F, let {zi}f^-^ be the d-dimensional 
feature vectors inside R. The region R is represented with 
the dx d covariance matrix of the feature vectors: 



CovSDr : 



— ( 



Zi - fx){zi - fJ,)^ 



(2) 



where /x is the mean of the points. Fig. 1 shows the con- 
struction of a covariance descriptor inside a spatio-temporal 
window. Examples of feature vectors specific for action and 
gesture recognition are given in Section 2.2. 

Representing a spatio-temporal window with a covari- 
ance matrix has several advantages: (i) it is a low- 
dimensional representation which is independent on the 
size of the window, (ii) the impact of noisy samples is re- 
duced through the averaging during covariance computa- 
tion, (iii) it is a straightforward method of fusing correlated 
features. 

2.1. Fast computation 

Integral images are an intermediate image representation 
used for the fast calculation of region sums [30]. The con- 
cept has been extended to image sequences [10], where the 
integral images are stacked to form an integral video, and 
can be used to compute spatio-temporal region sums in con- 
stant time. For a video V, its integral video IV is defined as: 

Tuzel et al. [29] used the integral image representations 
for fast calculation of flat region covariances. Here we ex- 
tend the idea for fast calculation of covariance matrices in- 
side a spatio-temporal window using the integral video rep- 
resentation. The (i, j)-th element of the covariance matrix 
defined in (2) can be expressed as: 



C0V?>DR{i,j): 



1 



^Zk{i)zk{3) - ^ ^Zfc (i)^Zfc (j) (4) 
_fc=i k=i 
where Zk{i) refers to the i-th element of the /c-th vector. 
To find the covariance in a given spatio-temporal window 
R, we have to compute the sum of each feature dimen- 
sion, z{i)f^^, as well as the sum of the multiplication of 
any two feature dimensions, z{i)z{j)ij=i___d- With d rep- 
resenting the number of dimensions, the covariance of any 
spatio-temporal window can be computed in 0{d'^) time, as 
follows. 



We need to compute a total of d-\-(f integral videos. Let 
P be the W x H x T x d tensor of the integral videos: 

x<^x' y^y' t<^t' 

where F{x,y,t){i) is the i-th element of vector F{x,y,t). 
Furthermore, let QhcthcW x H x T x d x d tensor of the 
second-order integral videos: 

x<x' y<y' t<t' 

for i, j = 1, . . . , The complexity of calculating the tensors 
is 0{d^WHT). The d-dimensional feature vector p^^^ and 
the d X d dimensional matrix Qx,y,t can be obtained from 
the above tensors using: 

Px^y^t = [ P{^. y.tA), . . . , P{x, 2/, t, d) f (7) 

/g(x,y,f,l,l) ••• Q{x,y,t,l,d)\ 

Q^,y,t = I \ ••. : (8) 

\Q{x,y,t,d,l) ••• Q{x,y,t,d,d)J 

Let R{xi , 2/1 , 1 1 ; X2 , 2/2 , ^2 ) be the spatio-temporal window 
of points {{x,y,t)\xi < X < X2,yi < y < 2/2,^1 < ^ < ^2}, 
as shown in Fig. 2. The covariance of the spatio-temporal 
window bounded by (0, 0, 0) and (x, y, t) is: 



CO'U3D^(0,0,0; x,y,t) 



Qx,y,t ^Px,y,t Px,y,t 



(9) 



where S = x-y-t. Similarly, after a few rearrangements, the 
covariance of the region 2/1,^1; 0:^2, 2/2,^2) can be com- 
puted as: 

^^^^P^R(xi,yi,ti- X2,y2,-t2) ^ 

^ _ I [^X2,y2 + Qxi-l,yi-l ~ Qx2,yi-1 ~ Qxi-l,y2 

~ ^^^2>?/2 ~^ 'Pxi—l,yi—l ~ Px2,yi—1 ~ Pxi—l,y2) 

{Px2,y2 ~^Pxi-l,yi-l -Px2,yi-1 ~Pxi-l,y2) ] ^1^) 

where ^ — n -n nn 

Px,y — Px,y,t2 Px,y,ti v^^^ 
Qx,y — Qx,y,t2 ~ Qx,y,ti (1^) 

and S = (x2-xi + l)- (2/2 - 2/1 + 1) • (^2 - tl + 1). 
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Figure 2. Integral feature video. The spatio-temporal window R 
is bounded by (xi, 2/1, ^i) and (x2, 2/2, ^2)- Each point in is a 
dimensional vector, where d is the number of features. 



2.2. Features and regions 

Commonly used features for action and gesture recogni- 
tion include intensity gradients and optical flow. Previous 
studies have shown the benefit of combining both types of 
features [4, 31]. We define the feature mapping present 
in (1), as the following combination of gradient and optical- 
flow based features, extracted from pixel location (x, y, t): 



^(V,x,y,t) = [x y t g o]^ 



where 

9 
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The first four gradient based features in (14) represent 
the first and second order intensity gradients at pixel loca- 
tion (x,2/). The last two gradient based features correspond 
to the gradient magnitude and gradient orientation. The 
optical-flow based features in (15) represent, in order: the 
horizontal and vertical components of the flow vector, the 
first order derivatives of the flow components with respect 
to t, and the spatial divergence and vorticity of the flow field 
as defined in [1]. Each descriptor is hence a 15 x 15 matrix, 
as ^(y, 2/, has 15 dimensions. 

For reliable recognition, several regions (and hence sev- 
eral descriptors) are typically used. Fig. 3 shows the spatio- 
temporal windows of two descriptors which can be used for 
recognition of face expressions. With the defined mapping, 
the input video V is mapped to F, a 15-dimensional feature 
video. Since the cardinality of the set of spatio-temporal 
windows {i? c F} is very large, we only consider windows 
of a minimum size and increment their location and size by 
a minimum interval value. Further specifics on the windows 
used in the experiments are given in Section 4. 

Following [29], each covariance descriptor CovSDji, is 
normalised with respect to the covariance descriptor of the 
region containing the full feature video, CovSDf, to im- 
prove the robustness against illumination variations: 

C^Dr = diagiC ovSDfY^ CovSDr diagiCovSDpY^ (16) 

where diag{Cov3DF) is equal to CovSDf at the diagonal 
entries and the rest is set to zero. 



Figure 3. Two examples of 
Cov3D windows that, together, can 
be useful for the recognition of face 
expressions. 




3. Classification of Actions and Gestures 

The Cov3D descriptors are symmetric positive definite 
matrices of size d x d, which can be formulated as a con- 
nected Riemannian manifold (Sym'^) [7]. In this section we 
first briefly overview Riemannian manifolds, followed by 
describing the proposed weighted Riemannian locality pre- 
serving projection (WRLPP) that allows mapping from Rie- 
mannian manifolds to Euclidean spaces. We then describe 
a classification algorithm that uses WRLPR 

3.1. Riemannian manifolds 

A manifold can be considered as a continuous surface 
lying in a higher dimensional Euclidean space. Formally, a 
manifold is a topological space which is locally similar to 
an Euclidean space [29] . Intuitively, the tangent space Tx is 
the plane tangent to the surface of the manifold at point X. 

A point Y on the manifold can be mapped to a vector 
in the tangent space Tx using the logarithm map operator 
logx- For Sym'^ the logarithm map is defined as: 

logx {Y) = xi log (^X-^YX-i^ xi (17) 

where log( ) is the matrix logarithm operator. Given 
the eigenvalue decomposition of a symmetric matrix, 
x; = UDU^, the matrix logarithm can be computed via: 

\og{'S) = U\og{D)U^ (18) 

where \og{D) is a diagonal matrix, with each diagonal el- 
ement equal to the logarithm of the corresponding element 
in D. 

The minimum length curve connecting two points on the 
manifold is called the geodesic, and the distance between 
two points is given by the length of this curve. Geodesies 
are related to the tangents in the tangent space. For Sym'^, 
the distance between two points on the manifold can be 

found via: ^ 

d'^{X,Y) = trace jlog^ lx-2YX-2\i (19) 

3.2. Weighted RLPP 

The usual approach for classification on manifolds is to 
first map the points into an appropriate Euclidean represen- 
tation [29] and then use traditional machine learning meth- 
ods. Points in the manifold can be mapped into a fixed 
tangent space (such as Tj where / is the identity matrix) 
[6]. Since distances in the manifold are only locally pre- 
served in the tangent space, better results can be achieved 
by considering the tangent space at the Karcher mean, the 
point which minimises the distances among the samples, as 
shown in [29] . Improved results have been obtained by con- 
sidering multiple tangent spaces [19, 25]. A more complex 
approach involves using training data to create a mapping 
that tries to preserve the relations between points, such as 
the RLPP approach [8]. 

RLPP is based on Laplacian eigenmaps [2]. Given 
training points X = {Xi,X2, - - ,Xn} from the underlying 



Riemannian manifold M, the local geometrical structure 
of M can be modelled by building an adjacency graph G. 
The simplest form of G is a binary graph obtained based on 
the nearest neighbour properties of Riemannian points: two 
nodes are connected by an edge if one node is among the k 
nearest neighbours of the other node. From the adjacency 
graph G we can find the degree and Laplacian matrices, re- 
spectively: 

L = D-G (21) 

where the degree matrix D is a diagonal matrix of size 
N X N, with diagonal entries indicating the the number of 
edges of each node in the adjacency graph. 

RLPP also uses a heat pseudo-kernel matrix K, with the 
(i, j)-th element constructed via: 

K(i,j) = k{X,,Xj) = exp |_^(^^^| (22) 

where d{-,-) is the geodesic distance defined in (19). 

The final mapping can be found through the following 
generalised eigenvalue problem [8]: 

KLK^A = XKDK^A (23) 

where the eigenvectors with the r smallest eigenvalues form 
the projection matrix A. 

The number of possible Cov3D descriptors inside a sam- 
ple video is very large. As such, we elected to use boosting 
to search for a subset of the best descriptors for classifica- 
tion. We could use the original RLPP mapping approach 
to map the matrices as vectors at each boosting iteration. 
However, as shown in [29], the sample weights can be used 
to generate a mapping which is more appropriate for the 
critical training samples. Therefore, we propose a modified 
projection, specifically designed to be used during boosting, 
which uses sample weights to generate the final mapping. 
We refer to this approach as weighted Riemannian locality 
preserving projection (WRLPP). 

In the modified projection, the adjacency graph G is re- 
placed with a weighted adjacency graph S, defined as: 

G = WGW (24) 
where W is a. diagonal matrix with diagonal values that 
correspond to the vector of sample weights [wi , W2 , • • • , . 
Using the weighted adjacency graph, edges involving crit- 
ical samples (ie. samples with higher weights) become 
more important and their geometrical structure is better 
preserved. The modified projection approach is detailed 
in Algorithm 1 . 

Once the the projection matrix A has been obtained, a 
given point C (a Cov3D matrix) on the manifold can then 
be mapped to Euclidean space via: 

WRLPP (C) = A^Kc (25) 
where Kc = [fc(Xi,C), /c(X2,C), k{XN,C)f, with 
k{-, ■) defined in (22), and {X^}^^ representing the training 
points. 



Algorithm 1 : obtaining weighted RLPP 



Algorithm 2 : Boosting with WRLPP 



Input: Training samples (covariance matrices), labels and weights 

• Create Riemannian pseudo-kernel matrix: 

=exp|-^^^%^ j using (19) as 

• Construct weighted adjacency graph: 

( Wi • Wj if Hi = yj and Xj is among the 
j) = \ k nearest neighbours of Xi in K. 

[ otherwise 

• Obtain the weighted degree N x N diagonal matrix: 

• Calculate the weighted Laplacian matrix: 
L = D -G 

• The eigenvectors with the r smallest eigenvalues of the Rayleigh 
quotient ^^^^ form the projection matrix A. 

Output: Projection model A = {A, {XijfL^} 



3.3. Classification 

As mentioned in the preceding section, we have chosen 
to use boosting to find a subset of the best descriptors for 
classification, as the number of possible Cov3D descrip- 
tors inside a sample video is large. For simplicity, we used 
a combination of one-vs-one LogitBoost classifiers [5] to 
achieve multiclass classification. 

We start with a brief description of binary LogitBoost 
classification, with class labels yi e {0, i}. The probability 
of sample x belonging to class l is represented by: 

exp{F(a3)} 



P{X) : 



(26) 



exp{F(£c)} + exp{-F(cc)} 
where F{x) = | J2^=i 9m{x), with g{x) representing a weak 
learner. 

The LogitBoost algorithm learns a set of M weak learn- 
ers by minimising the negative binomial log likelihood of 
the data. A weighted least squares regression gm{x) of train- 
ing points Xi G R"^ is fitted to response values Zi g M, with 
weights Wi, where 

Wi = p{xi){l - p(xi)) (27) 

= /-7(-;) (28) 

p{xi){l -p{xi)) 
As we are using Cov3D descriptors (covariance matri- 
ces) as input data, we adapt the weak learners gm{') to 
use the projected descriptors. In other words, gm(x) is re- 
placed with (WRLPP (X)), with X representing a covari- 
ance matrix. 

For every unique pair of classes, we train a one-vs-one 
LogitBoost classifier as follows. Only the samples belong- 
ing to the pair of classes are used for training the binary 
classifier. One class is selected to be the positive class and 
the other as the negative class. For each boosting iteration, 
we search for the region whose Cov3D descriptor best sepa- 
rates positive from negative samples. The descriptor is cal- 
culated for all the training samples and mapped to vector 



Input: Training videos with labels {(^i, belonging to Nc 

classes 

• For each unique pair of class labels < k,l > train the one-vs-one 
classifier C<^k,i> 

- Let k be the positive class label and restrict the training set to 

{(yj^yj)}j = l...N\yje{k,l} 

- Let either /c or / be the positive label y<:k,i> 

- Create binary labels y^. ^ (yj = y^j,^i^) 

- Start with wj = 1/N, F{V) = 0, p{Vj) = ^,m = l 

- Repeat while p(l^) — p{Vn) < margin 

* Compute the response values and weights 

^3 



,wj=p{yj){i-p{yj)) 



p{Vj){l-p{Vj))^^3 

* For each spatio-temporal window Rs 

• Construct the descriptors Xj^s = CovSDj fi^ 

• From {(Xj^s^Vj TWj)}jLi obtain the proj ection 
model WRLPPs using Algorithm 1 

• Map the data points Xj^s=WRLPPs (Xj^s) 
using (25) 

• Fit function gs (x) by weighted least-squares 
regression of zj to xj^s using weights wj 

* Update F{V) ^ F{V) + |/m(V), where fm is the 
best classifier among {fs} which minimises the negative 
binomial log-likelihood 

- Ef=i [y'j ^ogipixj)) + (1 - 2/^.) log(l - p(xj))] 

* Update ;?(y) ^ ^Fivf^^-Fiv) 

* Sort positive and negative samples according to 
descending probabilities and find samples at the decision 
boundaries Vp = (dr • A^p)-th y+, = (rr • A^n)-th 
V~ , where dr and rr are the desired detection and false 
positive rejection rates 

* m ^ m + 1 

- Store C<fc,i> = {(Rm,WRLPPm,gm)}^^[^i, threshold 
^<k,i> = F{Vn) and positive label y<fc,;> 

Output: A set of ^c{Nc-i) Qj^g.y^.Qjjg classifiers 



space with WRLPP, using the sample weights calculated for 
the current boosting iteration. Once in vector space, we fit 
a linear regression and use it as the weak LogitBoost classi- 
fier. 

To prevent overfitting, the number of weak classifiers 
on each one-vs-one classifier is controlled by a probabil- 
ity margin between the last accepted positive and the last 
rejected negative. Both margin samples are determined by 
the target detection rate (dr) and the target false positive re- 
jection rate (rr). The final multiclass classifier is a set of 
one-vs-one classifiers. Each one-vs-one classifier C^k,i>, 
where k and I are the labels of its two classes, has a positive 
class y<ck,i> and a threshold T^k,i>' The positive class is the 
label of the class deemed to be positive and the threshold 
is found via boosting. Algorithm 2 summarises the training 
process. 



Dataset 




UCF [24] 



CK+ [18] 



Cambridge [12] 



Type 

Classes 

Subjects 

Scenarios 

Video samples 

Resolution 



sports 
10 



150 
variable 



facial expressions 
7 

123 

593 

640 X 480 



hand gestures 
9 
2 
5 
900 
320 X 240 



Table 1. Overview of the datasets used in the experiments. 



A sample video V is classified as follows. Given a one- 
vs-one classifier C^k,i>, the probability of a sample video 
V belonging to the positive class y<:k,i> is evaluated using: 

M 

C<k,i>(V) =Y.3m (wRLPP^(CW3Dh^)) -r<fc,^> (29) 

m=l 

After evaluating V with all the one-vs-one classifiers in 
the set, the sample is labelled as the class a which max- 
imises: 

C{V) = arg max V C<,,,> (V) signa(C<,,,> (V)) (30) 

where signa(C<a,i> (V)) is sign(C<a,i> (V)) if a is the positive 
class y<a,i>, or 1 - sign(C<a,z>(V^)) otherwise. In other 
words, V is labelled as the class with greater probability 
sum, selecting all the one-vs-one classifiers that evaluate to 
that class. 

4. Experiments 

We compared the performance of the proposed algorithm 
against baseline approaches as well as several state-of-the- 
art methods. We used three benchmark datasets, with an 
overview of the datasets shown in Table 1 . 

In the following subsections, we first present an evalua- 
tion of several Riemannian to Euclidean space mapping ap- 
proaches, justifying the use of the weighted RLPP. We then 
follow with experiments showing the performance on sport 
actions, facial expressions and hand gestures. 

Unless otherwise stated, no pre-processing was per- 
formed in the input sequences and all the recognition re- 
sults were obtained using 5 -fold cross validation to divide 
the samples into training and testing sets. 

In all cases we used the following parameters: 0.95 de- 
tection rate, 0.95 false positive rejection rate, 0.5 margin. 
Furthermore, since the search space of spatio-temporal win- 
dows is very large, we restricted the minimum size of the 
windows, as well as the minimum increment on location 
and size of the windows, to | of the frame size. 

4.1. Comparison of mapping approaches 

In Fig. 4, we compare the following six Riemannian to 
Euclidean space mapping (Sym'^ ^ R) approaches which 
can be used during boosting: (i) no mapping (ie., using a 



Vector space 

Identity 
RLPP 
' - - Weighted IVIean 
^— 3-Tangent Spaces 
*— WRLPP 




false positives per window 

Figure 4. Performance comparison of various Syrri^ ^ R map- 
ping approaches, used within the classifier framework described in 
Section 3.3. 



vectorised representation of the upper-triangle of the co- 
variance matrix), (ii) projection to a fixed tangent space [6], 
(iii) projection to the weighted Karcher mean of the sam- 
ples [29], (iv) projection using k-tangent spaces [25], 
(v) mapping the points with the original RLPP method [8], 
and (vi) mapping the points with the proposed WRLPP ap- 
proach. 

Since the mapping approach affects individual binary 
classifiers, we show results per classifier with detection er- 
ror trade-off curves. We chose the one-vs-one classifiers be- 
tween conflicting class pairs (where samples of one class are 
misclassified as the other class) on the Cambridge hand ges- 
ture recognition dataset (which is described in Section 4.4). 
Each point on the curve represents the average of all the 
chosen classifiers. The curves were obtained by varying the 
classification threshold r in Algorithm 2. 

With the exception of the original RLPP method, incre- 
mentally better results are obtained by using the mapping 
approaches in the mentioned order, as they provide increas- 
ingly better vector representations of the manifold space. 
Although RLPP is designed to provide a better representa- 
tion compared to tangent-based approaches, it appears not 
to be appropriate for boosting as it does not take into ac- 
count the sample weights of critical training points. The 
proposed WRLPP method addresses this problem, resulting 
in the best overall performance. 

4.2. UCF sport dataset 

The UCF sport action dataset [24] consists of ten cat- 
egories of human actions, containing videos with non- 
uniform backgrounds where both the camera and the subject 
might be moving. We use the regions of interest provided 
with the dataset. 

We compared the Cov3D approach against the follow- 
ing methods: H0G3D [31], hierarchy of discriminative 



Method Performance 

HOG3D [31] 85.60% 

HDN [13] 87.27% 

AFMKL[32] 91.30% 

Cov3D 93.91% 

Table 2. Average recognition rate on the UCF dataset [24]. 

space-time neighbourhood features (HDN) [13], and aug- 
mented features in conjunction with multiple kernel learn- 
ing (AFIMKL) [32]. H0G3D is the extension of his- 
togram of oriented gradient descriptor [15] to the spatio- 
temporal case. HDN learns shapes of space-time feature 
neighbourhoods that are most discriminative for a given 
action category. The idea is to form new features com- 
posed of the neighbourhoods around the interest points in 
a video. AFIMKL exploits appearance distribution features 
and spatio-temporal context features in a learning scheme 
for action recognition. As shown in Table 2, the proposed 
Cov3D-based approach achieves the highest accuracy. 

4.3. CK+ facial expression dataset 

The extended Cohn-Kanade (CK+) facial expression 
database [18] contains 593 sequences from 123 subjects. 
We used the sequences with validated emotion labels, 
among 7 possible emotions. The image sequences vary in 
duration (i.e. 10 to 60 frames) and incorporate the onset 
(which is also the neutral frame) to peak formation of the 
facial expressions. 

We compared the Cov3D approach against active appear- 
ance models (AAIM), constrained local models (CLIM) [3], 
and temporal modelling of shapes (T]V[S) [9]. AAIM is the 
baseline approach included with the dataset. It uses ac- 
tive appearance models to track the faces and extract the 
features, and then uses support vector machines (SVM) 
to classify the facial expressions. The CLIM approach is 
an improvement on AA]V[, designed for better generalisa- 
tion to unseen objects. The TIMS approach uses latent- 
dynamic conditional random fields to model temporal vari- 
ations within shapes. 

We show the performance per emotion in Table 3, in 
line with existing literature. The proposed Cov3D approach 
achieves the highest average recognition accuracy of 92.3% 
(averaged over the 7 classes). The next best method (TIMS) 
obtained an average accuracy of 87.92%. 

4.4. Cambridge hand gesture dataset 

The Cambridge hand-gesture dataset [12] consists of 900 
image sequences of 9 gesture classes. Each class has 100 
image sequences performed by 2 subjects, captured under 
5 illuminations and 10 arbitrary motions. The 9 classes are 
defined by three primitive hand shapes and three primitive 
motions. Each sequence was recorded with a fixed cam- 
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AAM [18] 


75.0 


84.4 


94.7 


65.2 


100 


68.0 


96.0 


CLM [3] 


70.1 


52.4 


92.5 


72.1 


94.2 


45.9 


93.6 


TMS [9] 


76.7 




81.5 


94.4 


98.6 


77.2 


99.1 


Cov3D 


94.4 


100 


95.5 


90.0 


96.2 


70.0 


100 



Table 3. Recognition rate (in %) on the CK+ dataset [18]. 



Method 


Setl 


Set2 


Set3 


Set4 


Overall 


TCCA [12] 


81% 


81% 


78% 


86% 


82% (±3.5) 


PM [20] 


89% 


86% 


89% 


87% 


88% (±2.1) 


TB [19] 


93% 


88% 


90% 


91% 


91% (±2.4) 


Cov3D 


92% 


94% 


94% 


93% 


93% (±1.1) 



Table 4. Average recognition rate on the Cambridge dataset [12]. 



era having roughly isolated gestures in space and time. We 
followed the test protocol defined in [12]. Sequences with 
normal illumination were considered for training while tests 
were performed on the remaining sequences. 

The proposed method was compared against tensor 
canonical correlation analysis (TCCA) [12], product mani- 
folds (PM) [20] and tangent bundles (TB) [19]. TCCAis the 
extension of canonical correlation analysis to multiway data 
arrays or tensors. Canonical correlation analysis and princi- 
pal angles are standard methods for measuring the similarity 
between subspaces. In the PM method a tensor is charac- 
terised as a point on a product manifold and classification 
is performed on this space. The product manifold is created 
by applying a modified high order singular value decompo- 
sition on the tensors and interpreting each factorised space 
as a Grassmann manifold. In the TB method, video data 
is represented as a third order tensor and factorised using 
high order singular value decomposition, where each factor 
is projected onto a tangent space and the intrinsic distance 
is computed from a tangent bundle for action classification. 

We report the recognition rates for the four test sets in 
Table 4, where the proposed Cov3D-based approach obtains 
the highest performance. 

5. Conclusion 

In this paper, we first extended the flat covariance de- 
scriptors proposed in [29] to spatio-temporal covariance de- 
scriptors termed Cov3D, and then showed how they can be 
computed quickly through the use of integral video repre- 
sentations. 

The proposed Cov3D descriptors belong to the group of 
symmetric positive definite matrices, which can be formu- 
lated as a connected Riemannian manifold. Prior to classi- 
fication, points on a manifold are generally mapped to an 
Euclidean space, through a technique such as Riemannian 
locality preserving projection (RLPP) [8]. 



The Cov3D descriptors are extracted from spatio- 
temporal windows inside sample videos, with the number of 
possible windows being very large. We used a boosting ap- 
proach to find a subset which is the most useful for classifi- 
cation. In order to take into account the weights of the train- 
ing samples, we further proposed to extend RLPP by in- 
corporating weighting during the projection. The weighted 
projection (termed WRLPP) leads to a better representa- 
tion of the neighbourhoods around the most critical training 
samples during each boosting iteration. 

Combining the proposed Cov3D descriptors with the 
classification approach based on WRLPP boosting leads to 
a state-of-the-art method for action and gesture recognition. 
The proposed Cov3D-based method performs better than 
several recent approaches on three benchmark datasets for 
action and gesture recognition. The method is robust and 
does not require additional processing of the videos, such 
as foreground detection, interest-point detection or track- 
ing. To our knowledge, this is the first approach proving to 
be equally suitable (ie., > 90% recognition accuracy) for 
both action and gesture recognition. 

Further avenues of research include adapting the method 
for related tasks, such as anomaly detection in surveillance 
videos [23], where there is often a shortage of positive ex- 
amples. 
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